On b-bit min-wise hashing for large-scale regression and classification with sparse data
نویسندگان
چکیده
Large-scale regression problems where both the number of variables, p, and the number of observations, n, may be large and in the order of millions or more, are becoming increasingly more common. Typically the data are sparse: only a fraction of a percent of the entries in the design matrix are non-zero. Nevertheless, often the only computationally feasible approach is to perform dimension reduction to obtain a new design matrix with far fewer columns, and then work with this compressed data. b-bit min-wise hashing [Li and König, 2011, Li et al., 2011] is a promising dimension reduction scheme for sparse matrices. In this work we study the prediction error of procedures which perform regression in the new lower-dimensional space after applying the method. For both linear and logistic models we show that the average prediction error vanishes asymptotically as long as q‖β‖2/n → 0, where q is the average number of non-zero entries in each row of the design matrix and β∗ is the coefficient of the linear predictor. We also show that ordinary least squares or ridge regression applied to the reduced data in a sense amounts to a non-parametric regression and can in fact allow us fit more flexible models. We obtain non-asymptotic prediction error bounds for interaction models and for models where an unknown row normalisation must be applied before the signal is linear in the predictors.
منابع مشابه
Minwise hashing for large-scale regression and classification with sparse data
We study large-scale regression analysis where both the number of variables, p, and the number of observations, n, may be large and in the order of millions or more. This is very different from the now well-studied high-dimensional regression context of “large p, small n”. For example, in our “large p, large n” setting, an ordinary least squares estimator may be inappropriate for computational,...
متن کاملb-Bit Minwise Hashing for Large-Scale Learning
Abstract Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and ...
متن کاملb-Bit Minwise Hashing for Large-Scale Linear SVM
Linear Support Vector Machines (e.g., SVM, Pegasos, LIBLINEAR) are powerful and extremely efficient classification tools when the datasets are very large and/or highdimensional, which is common in (e.g.,) text classification. Minwise hashing is a popular technique in the context of search for computing resemblance similarity between ultra high-dimensional (e.g., 2) data vectors such as document...
متن کاملTraining Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)
Our recent work on large-scale learning using b-bit minwise hashing [21, 22] was tested on the webspam dataset (about 24 GB in LibSVM format), which may be way too small compared to real datasets used in industry. Since we could not access the proprietary dataset used in [31] for testing the Vowpal Wabbit (VW) hashing algorithm, in this paper we present an experimental study based on the expand...
متن کاملCompressed Image Hashing using Minimum Magnitude CSLBP
Image hashing allows compression, enhancement or other signal processing operations on digital images which are usually acceptable manipulations. Whereas, cryptographic hash functions are very sensitive to even single bit changes in image. Image hashing is a sum of important quality features in quantized form. In this paper, we proposed a novel image hashing algorithm for authentication which i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016